-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add predicate pushdown (updated) #78
base: main
Are you sure you want to change the base?
Conversation
select_statement->node = std::move(select_node); | ||
return make_uniq<SubqueryRef>(std::move(select_statement), "iceberg_scan"); | ||
vector<Value> structs; | ||
for (const auto &file : data_file_values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The majority of lines changed here are still formatting changes - could you format using our clang-format config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mytherin fixed the formatting, sorry about that!
Also @mike-luabase, you don't need to reopen the PR every time, this makes it harder to review actually since we lose a clear view on the review comments that were made before. If you want to overwrite commits you can just force push to your feature branch |
Yes, understood, won't do that again. |
This test is failing, looking into it
|
@Mytherin test failure should be fixed now |
@samansmink @Mytherin can we trigger the run again? This error doesn't seem related to my changes:
|
@mike-luabase 💚 I'm so excited to see movement on this! I'll be able to delete a bunch of manual partition pruning + read_parquet code in an app of mine! |
@samansmink anything I can do to help here? |
If you could re-add a PR description and address @Mytherin's comment from one of your previous PRs #75 (comment), I will take a more detailed look to review this. Please be considerate that reviews take a lot of time, especially reviews from outside contributors that are touching complicated code. To get your PRs through as quick as possible, I recommend you to:
|
@samansmink I added the description and fixed the formatting! I understand this PR is on the larger side, but I think it's as small as it could be to get predicate pushdown working. I'm more than happy to discuss the PR! I'll drop a note in the Discord. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reducing the changeset - this is easier to review. I still wonder about the changes to the Avro files. In addition - there's no tests included with the changeset. Could you add tests that stress test the predicate pushdown in various ways (e.g. testing out different predicate types, different data types, etc)?
|
||
/* This code was generated by avrogencpp 1.13.0-SNAPSHOT. Do not edit.*/ | ||
|
||
#ifndef MANIFEST_ENTRY_HH_2043678367_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand a bit on why the avro/iceberg_types
files are being changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mytherin the existing types didn't have a way to access lower_bounds
and upper_bounds
, which we need for pushdown.
on it! |
@Mytherin I added tests and updated the data generation for predicate pushdown. Let me know if that works! |
eb8c753
to
136b5eb
Compare
136b5eb
to
d9600d4
Compare
I see files from the |
As mentioned before, @mike-luabase why is this PR removing the checked in test data? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the efforts, Mike. Was looking through it; but I would not call it a review and in any case I think DuckDB Labs folks need to do that. I marked a number of diffs in the PR that I think could be removed: the deletion of the generated files (as already remarked by Sam) and some smaller things.
src/include/iceberg_types.hpp
Outdated
lower_bounds[std::to_string(lb.key)] = lb.value; | ||
} | ||
} else { | ||
fprintf(stderr, "Lower bounds ISSSSS null\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ISSSSS?
is printing to stderr the right way to signal the absence of bounds info in a manifest?
@@ -1,4 +1,4 @@ | |||
#!/usr/bin/python3 | |||
#!/Users/mritchie712/opt/anaconda3/bin/python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will break on other computers. In this file there are some useful changes (partitioning) but also spurious changes likes semicolons and spacing.. It would be nice to remove the spurious changes to make the diff as small as possible
@@ -80,8 +80,8 @@ vector<IcebergManifestEntry> IcebergTable::ReadManifestEntries(const string &pat | |||
} | |||
} else { | |||
auto schema = avro::compileJsonSchemaFromString(MANIFEST_ENTRY_SCHEMA); | |||
avro::DataFileReader<c::manifest_entry> dfr(std::move(stream), schema); | |||
c::manifest_entry manifest_entry; | |||
avro::DataFileReader<manifest_entry> dfr(std::move(stream), schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes are spurious and hence this file change could be removed from the PR
This PR enhances query performance by filtering data at the metadata level, reducing the amount of data read during scans. It currently only addresses queries containing a single Iceberg table.
Key Changes
Extended
IcebergManifestEntry
:lower_bounds
andupper_bounds
maps to store column statistics.Utility Function:
IcebergUtils::GetFullPath
to resolve file paths accurately.Metadata Retrieval:
GetEntries
template inIcebergTable
to fetch relevant manifest entries, excluding deleted ones.Predicate Evaluation:
EvaluatePredicateAgainstStatistics
to assess if data files satisfy query predicates based on their statistics.Scan Expression Modification:
MakeScanExpression
to filterdata_file_entries
using predicates before scanning.Binding Function Enhancements:
IcebergScanBindReplace
with additional logging and prepared data files based on predicate results.This implementation optimizes Iceberg table scans by leveraging metadata for early data filtering, significantly improving query efficiency and resource usage.